In [ ]:
from google.colab import drive
drive.mount("/content/gdrive/")
%cd /content/gdrive/MyDrive/data_science/final_project/
!ls
Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).
/content/gdrive/MyDrive/data_science/final_project
'Copy of milestone2.ipynb'   final.ipynb   revenue.xlsx

Stocks trading recommendation strategies project¶

By Tianyi Xu(https://xutianyi01.github.io/)¶

Performing an analysis on the historical and earnings data of over 50 tech companies , visualizing trends, and patterns. Time period is from 2010 to 2023. The goal is to find the most crucial features to do best trading recommendation strategies or stock price prediction.

Project Framework¶

  1. Introduction
  2. Data Collection and Preprocessing
  3. Exploratory Data Analysis (EDA)
  4. Model Development and Evaluation
  5. Conclusion

1. Introduction¶

Platforms such as Yahoo Finance and related financial apps have altered the way that stock market data is accessed in the current investment world. They give us access to a multitude of information, including complex financial indicators and real-time stock prices. The difficulty in sorting through the vast amount of data is figuring out which facts are essential for wise investment choices.

The stock market is a maze of factors that can be confusing for both amateurs and experienced traders. Complex elements such as corporate earnings, market sentiment, economic indicators, and world events all affect how much stock prices move. Successful stock trading requires an understanding of the interactions between these numerous variables and the identification of the most significant ones.

My project sets off on a quest to solve these puzzles. We explore the core of stock market data with the goal of elucidating the variety of features and indications. Using cutting-edge data analysis methods and machine learning models, we aim to address the following crucial questions:

What are the most crucial features that influence stock prices?

How do these features interact with each other, and what are their relationships?

Can we predict future stock prices or do the recommendation based on the crucial features?

The Dataset We Are Using¶

I find three datasets realted to this project. One dataset is 50 tech company Stock History about stock prices and volumes, two are about Earning Report: one is from Macrotrends(https://www.macrotrends.net/) and the other is from Yahoo Finance(https://finance.yahoo.com/). The time period of all the dataset is 2010 to 2023.

Stock History¶

I got my hands on the historical data of these companies as TABLE1, including their stock prices and volumes over the years, using a library called yfinance(https://pypi.org/project/yfinance/). TABLE1 is the main dataset we need to focus on, it will describe how the stock price move.

Earnings Reports¶

We’re also looking at earnings data from Macrotrends(TABLE2) and Yahoo Finance(TABLE3) to get an idea of how these companies are doing money-wise, and if they're growing or not. From Macrotrends, I’ve got:

  • Date: When the earnings report was released.
  • Revenue (in millions): Tells us how much money the company raked in.

and we will think about the question: Is Revenue the most important feature to decide the how the stock moves? What is the relationship between this feature and other features?

From Yahoo Finance, I’ve got the earnings calendar that includes:

  • Earnings Date: the date when the earnings were announced.
  • EPS Estimate: The predicted Earnings Per Share (EPS) - basically, how much profit the company is expected to make for each share.
  • Reported EPS(EPS Actual): The actual EPS.
  • Surprise(%): measures the difference between actual earnings and analyst estimates.

and we will think about the question: Is Reported EPS the most important feature to decide the how the stock moves? Compared with other features, Is this a better determinant of stock movement?

Trend Indicators¶

I’m adding in some trend indicators like MACD, RSI, and Bollinger Bands to help us spot patterns and trends in the data more easily by using pandas_ta(https://github.com/twopirllc/pandas-ta) library. We tidy TABLE1 and can get these indicators, and we will consider questions: Which one of MACD, RSI, Bollinger Bands is the important feature to help us decide the the how to operate stocks strategies? What is the relationship between one feature and other features?

How we’re working together¶

Only me do the project. I use Google Drive and Google Colab as my development platform.

2. Data Collection and Preprocessing¶

Import all the Necessary Libraries¶

We use yfinance(https://pypi.org/project/yfinance/) to get the stock daily prices data and use pandas_ta(https://github.com/twopirllc/pandas-ta) to get indicators to be used for future analysis

In [ ]:
!pip install pandas_ta
!pip install python-dateutil
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas_ta as ta
from dateutil import parser
import time
import seaborn as sns
import warnings
import numpy as np
warnings.filterwarnings('ignore')
Requirement already satisfied: pandas_ta in /usr/local/lib/python3.10/dist-packages (0.3.14b0)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from pandas_ta) (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (2023.3.post1)
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (1.23.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->pandas_ta) (1.16.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil) (1.16.0)

TABLE1: Get the original dataset from Yahoo Finance for one stock APPLE¶

Columns Open, High, Low, Close, Adj Close are about stock price. The volume is the number of shares that were traded during a given day.

In [ ]:
# Load data from Yahoo Finance
ticker = "AAPL"
stock_data = yf.download(ticker, start="2020-01-01", end="2023-01-01")

# Display the data
display(stock_data.head())

print(stock_data.shape)
display(stock_data.dtypes)
[*********************100%%**********************]  1 of 1 completed
Open High Low Close Adj Close Volume
Date
2020-01-02 74.059998 75.150002 73.797501 75.087502 73.152649 135480400
2020-01-03 74.287498 75.144997 74.125000 74.357498 72.441460 146322800
2020-01-06 73.447502 74.989998 73.187500 74.949997 73.018684 118387200
2020-01-07 74.959999 75.224998 74.370003 74.597504 72.675278 108872000
2020-01-08 74.290001 76.110001 74.290001 75.797501 73.844353 132079200
(756, 6)
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

Above dataset is the example of one stock, now we extend it to 50 stocks

Specifying the Stock Tickers¶

We have chosen 50 tech company tickers for our analysis.

In [ ]:
tickers = [
    "AAPL", "MSFT", "AMZN", "GOOGL", "META", "TSLA", "NVDA", "PYPL", "INTC", "CSCO",
    "ADBE", "CRM", "ACN", "TXN", "ORCL", "QCOM", "IBM", "AVGO", "AMD", "SAP",
    "SHOP", "NOW", "SQ", "INTU", "FIS", "VMW", "HPE", "HPQ", "MU", "AMAT",
    "LRCX", "KLAC", "ADI", "MCHP", "CDNS", "ANSS", "KEYS", "TEL", "WDAY", "CTSH",
    "DOCU", "SNPS", "FTNT", "GLW", "VRSN", "AKAM", "SWKS", "TTD", "EPAM", "NET"
]

TABLE1 Tidy: Add New Indicators¶

MACD (Moving Average Convergence Divergence): MACD is a trend-following momentum indicator that shows the relationship between two moving averages of a stock’s price.

MACD Signal: The MACD signal line is a 9-day EMA of the MACD. When the MACD crosses above the signal line, it gives a bullish (buy) signal, and when it crosses below, it gives a bearish (sell) signal.

MACD Histogram: The MACD histogram is the difference between the MACD and its signal line. A positive histogram suggests that bulls are in control (buying pressure), while a negative histogram suggests bears have control (selling pressure).

RSI (Relative Strength Index): RSI measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock. Readings above 70 indicate that a stock may be overbought, and readings below 30 indicate that the stock may be oversold.

Bollinger Upper Band: The upper Bollinger Band is two standard deviations above the simple moving average (SMA) of a stock's price. It helps identify when a stock’s price is considered "high."

Bollinger Middle Band: The middle Bollinger Band is the SMA of the stock’s price, often serving as support or resistance levels for the stock price.

Bollinger Lower Band: The lower Bollinger Band is two standard deviations below the SMA of the stock’s price. It helps identify when a stock’s price is considered "low."

In [ ]:
def fetch_and_enhance_stock_data(tickers):
    all_data = []

    for ticker in tickers:
        try:
            # Fetch historical stock data
            stock_data = yf.download(ticker, start="2010-01-01", end="2023-01-01")

            # Calculate MACD
            macd = ta.macd(stock_data['Close'])
            stock_data['MACD'] = macd.iloc[:, 0]
            stock_data['MACD Signal'] = macd.iloc[:, 1]
            stock_data['MACD Histogram'] = macd.iloc[:, 2]

            # Calculate RSI
            stock_data['RSI'] = ta.rsi(stock_data['Close'])

            # Calculate Bollinger Bands
            bbands = ta.bbands(stock_data['Close'])
            stock_data['Bollinger Lower'] = bbands.iloc[:, 0]
            stock_data['Bollinger Middle'] = bbands.iloc[:, 1]
            stock_data['Bollinger Upper'] = bbands.iloc[:, 2]

            stock_data['Stock'] = ticker
            cols = ['Stock'] + [col for col in stock_data if col != 'Stock']
            stock_data = stock_data[cols]
            # Drop nan
            all_data.append(stock_data.dropna())
        except Exception as e:
            print(f"Could not retrieve data for {ticker}: {str(e)}")
            continue

    # Concatenate all the individual DataFrames into one
    combined_data = pd.concat(all_data, axis=0)

    return combined_data


# Fetch and enhance the stock data
stock_data = fetch_and_enhance_stock_data(tickers)

table1 = stock_data.copy()

display(stock_data.head(10))
print(stock_data.shape)
display(stock_data.dtypes)
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
Stock Open High Low Close Adj Close Volume MACD MACD Signal MACD Histogram RSI Bollinger Lower Bollinger Middle Bollinger Upper
Date
2010-02-22 AAPL 7.226429 7.232143 7.113929 7.157857 6.067570 390563600 -0.088343 0.066146 -0.154489 47.617442 7.145997 7.221214 7.296432
2010-02-23 AAPL 7.142857 7.190357 6.989643 7.037857 5.965849 575094800 -0.091067 0.050738 -0.141805 43.102673 7.024712 7.175929 7.327145
2010-02-24 AAPL 7.079643 7.194286 7.065714 7.166429 6.074836 460566400 -0.081906 0.047918 -0.129825 48.713452 7.022721 7.162429 7.302136
2010-02-25 AAPL 7.049286 7.245000 7.031786 7.214286 6.115404 665126000 -0.069979 0.047877 -0.117856 50.663657 7.030473 7.155786 7.281099
2010-02-26 AAPL 7.227857 7.327500 7.214286 7.307857 6.194722 507460800 -0.052372 0.052387 -0.104759 54.321085 7.001694 7.176857 7.352020
2010-03-01 AAPL 7.348214 7.482143 7.337500 7.463929 6.327023 550093600 -0.025530 0.063383 -0.088913 59.688942 6.952883 7.238072 7.523261
2010-03-02 AAPL 7.497500 7.529643 7.419286 7.458929 6.322784 566546400 -0.004608 0.067444 -0.072052 59.447917 7.077504 7.322286 7.567068
2010-03-03 AAPL 7.462143 7.495357 7.426429 7.476071 6.337313 372052800 0.013204 0.068205 -0.055001 60.043600 7.174299 7.384214 7.594130
2010-03-04 AAPL 7.474286 7.532857 7.451071 7.525357 6.379092 366041200 0.030940 0.068753 -0.037813 61.781887 7.300079 7.446429 7.592779
2010-03-05 AAPL 7.676429 7.846429 7.665357 7.819643 6.628554 899620400 0.067959 0.084617 -0.016658 70.136225 7.273868 7.548786 7.823704
(146264, 14)
Stock                object
Open                float64
High                float64
Low                 float64
Close               float64
Adj Close           float64
Volume                int64
MACD                float64
MACD Signal         float64
MACD Histogram      float64
RSI                 float64
Bollinger Lower     float64
Bollinger Middle    float64
Bollinger Upper     float64
dtype: object

TABLE2: Function to Fetch Earnings Data from Macrotrends and TABLE2 Tidy: Clean data and combine all 50 stocks¶

In Milestone 1 and Milestone 2 the function to scrape earnings data from Macrotrends(https://www.macrotrends.net/) for each stock can work perfectly. For each ticker, we will use url = 'https://www.macrotrends.net/stocks/charts/{ticker.lower()}/{ticker}/revenue' to scrape data by replacing ticker with each element of tickers. For example, Apple Quarterly Revenue data is in https://www.macrotrends.net/stocks/charts/aapl/aapl/revenue. But in the final, it seems that Macrotrends blocks the scraping. So I used the previous dataset kept in the Milestone 2 and save it in the https://docs.google.com/spreadsheets/d/e/2PACX-1vQrSyFpQ9sFL4_K5ijoQ2xLjljVo_4VHTOo2bVms2jyDCXNHCjbwUPcotcmzvm3bw/pub?output=xlsx. We can use Apple Quarterly Revenue as a long-term investment analysis.

In [ ]:
import pandas as pd
import requests
from io import BytesIO

google_sheet_url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQrSyFpQ9sFL4_K5ijoQ2xLjljVo_4VHTOo2bVms2jyDCXNHCjbwUPcotcmzvm3bw/pub?output=xlsx'

# Make the request to the URL
response = requests.get(google_sheet_url)

# If the request was successful, read the data into a pandas DataFrame
if response.status_code == 200:
    # Read the data into a pandas DataFrame
    data = pd.read_excel(BytesIO(response.content))


    # Initialize an empty DataFrame to hold the cleaned data
    cleaned_data = pd.DataFrame(columns=['Date', 'Revenue', 'Stock'])

    # Variable to track the current ticker as we iterate through the data
    current_ticker_index = 0

    # Go through the DataFrame, look for the tickers and corresponding data
    for index, row in data.iterrows():
        # Identify company name header and update the ticker
        if isinstance(row[0], str) and 'Quarterly Revenue' in row[0]:
            # Get the current ticker based on the index
            current_ticker = tickers[current_ticker_index]
            # Increase the index for the next company
            current_ticker_index += 1
            continue

        # Skip header rows
        if '(Millions of US $)' in str(row[0]):
            continue


        # Process and append the data rows
        if pd.notna(row[0]) and pd.notna(row[1]):
            date = pd.to_datetime(row[0]).date()
            # Check if revenue is a string and needs cleaning
            if isinstance(row[1], str):
                revenue = pd.to_numeric(row[1].replace('$', '').replace(',', ''))
            else:
                # Revenue is already a number, no need for cleaning
                revenue = row[1]
            cleaned_data = cleaned_data.append({
                'Date': date,
                'Revenue': revenue,
                'Stock': tickers[current_ticker_index]
            }, ignore_index=True)
    # Convert the 'Date' column to datetime64[ns] format if it's not already
    cleaned_data['Date'] = pd.to_datetime(cleaned_data['Date'])
else:
    print("Failed to fetch the file.")


table2 = cleaned_data.copy()
display(table2.head(100))
print(table2.shape)
display(table2.dtypes)
Date Revenue Stock
0 2023-09-30 89498.0 AAPL
1 2023-06-30 81797.0 AAPL
2 2023-03-31 94836.0 AAPL
3 2022-12-31 117154.0 AAPL
4 2022-09-30 90146.0 AAPL
... ... ... ...
95 2014-09-30 23201.0 MSFT
96 2014-06-30 23382.0 MSFT
97 2014-03-31 20403.0 MSFT
98 2013-12-31 24519.0 MSFT
99 2013-09-30 18529.0 MSFT

100 rows × 3 columns

(2717, 3)
Date       datetime64[ns]
Revenue           float64
Stock              object
dtype: object

TABLE3: Fetching Earnings Data from Yahoo Finance¶

A function to scrape earnings data from Yahoo Finance(https://finance.yahoo.com/) for each stock. We can use Reported EPS as a long-term investment decision.

In [ ]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
def fetch_yahoo_earnings_data(ticker):
    url = f'https://finance.yahoo.com/calendar/earnings?symbol={ticker}'
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to retrieve the data for {ticker}")
        return None

    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table')
    earnings_data = pd.read_html(str(table))[0]

    return earnings_data

msft_yahoo_earnings_data = fetch_yahoo_earnings_data('MSFT')
display(msft_yahoo_earnings_data.head(10))
print(msft_yahoo_earnings_data.shape)
display(msft_yahoo_earnings_data.dtypes)
Symbol Company Earnings Date EPS Estimate Reported EPS Surprise(%)
0 MSFT Microsoft Corp Oct 22, 2024, 6 AMEDT - - -
1 MSFT Microsoft Corp Jul 23, 2024, 6 AMEDT - - -
2 MSFT Microsoft Corp Apr 23, 2024, 6 AMEDT - - -
3 MSFT Microsoft Corp Jan 22, 2024, 4 PMEST 2.78 - -
4 MSFT Microsoft Corp Jan 22, 2024, 5 AMEST 2.78 - -
5 MSFT Microsoft Corporation Oct 24, 2023, 12 PMEDT 2.65 2.99 +12.7
6 MSFT Microsoft Corporation Jul 25, 2023, 12 PMEDT 2.55 2.69 +5.49
7 MSFT Microsoft Corporation Apr 25, 2023, 12 PMEDT 2.23 2.45 +9.81
8 MSFT Microsoft Corporation Jan 24, 2023, 11 AMEST 2.29 2.32 +1.09
9 MSFT Microsoft Corporation Oct 25, 2022, 12 PMEDT 2.3 2.35 +2.05
(100, 6)
Symbol           object
Company          object
Earnings Date    object
EPS Estimate     object
Reported EPS     object
Surprise(%)      object
dtype: object

TABLE3 Tidy: Clean above table¶

In [ ]:
def parse_date(date_string):
    try:
        return parser.parse(date_string)
    except Exception as e:
        return None



def clean_yahoo_earnings_data(df, ticker):
    cleaned_data = []
    for index, row in df.iterrows():
        try:
            if row['EPS Estimate'] == '-' or row['Reported EPS'] == '-' or row['Surprise(%)'] == '-':
                continue

            date_string = row['Earnings Date']

            earnings_date = parse_date(date_string)
            eps_estimate = float(row['EPS Estimate']) if row['EPS Estimate'] != '-' else None
            reported_eps = float(row['Reported EPS']) if row['Reported EPS'] != '-' else None
            surprise = float(row['Surprise(%)'].replace('%', '')) if row['Surprise(%)'] != '-' else None

            cleaned_data.append([ticker, earnings_date, eps_estimate, reported_eps, surprise])
        except Exception as e:
            continue


    cleaned_df = pd.DataFrame(cleaned_data, columns=['Stock', 'Earnings Date', 'EPS Estimate', 'Reported EPS', 'Surprise(%)'])
    return cleaned_df

TABLE3 Tidy: Combine all 50 stocks¶

In [ ]:
all_yahoo_earnings_data = []

for ticker in tickers:
    raw_yahoo_earnings_data = fetch_yahoo_earnings_data(ticker)

    if raw_yahoo_earnings_data is not None and not raw_yahoo_earnings_data.empty:
        cleaned_yahoo_earnings_data = clean_yahoo_earnings_data(raw_yahoo_earnings_data, ticker)

        if not cleaned_yahoo_earnings_data.empty:
            all_yahoo_earnings_data.append(cleaned_yahoo_earnings_data)
        time.sleep(1)

if all_yahoo_earnings_data:
    combined_yahoo_earnings_data = pd.concat(all_yahoo_earnings_data, ignore_index=True)
else:
    print("No data to display.")
In [ ]:
table3 = combined_yahoo_earnings_data.copy()

display(combined_yahoo_earnings_data.head(10))
print(combined_yahoo_earnings_data.shape)
display(combined_yahoo_earnings_data.dtypes)
Stock Earnings Date EPS Estimate Reported EPS Surprise(%)
0 AAPL 2023-11-02 12:00:00 1.39 1.46 4.92
1 AAPL 2023-08-03 12:00:00 1.19 1.26 5.49
2 AAPL 2023-05-04 12:00:00 1.43 1.52 6.03
3 AAPL 2023-02-02 11:00:00 1.94 1.88 -2.88
4 AAPL 2022-10-27 12:00:00 1.27 1.29 1.55
5 AAPL 2022-07-28 12:00:00 1.16 1.20 3.25
6 AAPL 2022-04-28 12:00:00 1.43 1.52 6.44
7 AAPL 2022-01-27 11:00:00 1.89 2.10 11.17
8 AAPL 2021-10-28 12:00:00 1.24 1.24 0.30
9 AAPL 2021-07-27 12:00:00 1.01 1.30 29.12
(3534, 5)
Stock                    object
Earnings Date    datetime64[ns]
EPS Estimate            float64
Reported EPS            float64
Surprise(%)             float64
dtype: object

Merge three Tables¶

I combine all three datasets (Table1: stock_data, Table2: combined_earnings_data, and Table3: combined_yahoo_earnings_data) and conduct a deep EDA to uncover relationships between important features. This will inform the development of a future stock recommendation model, whether it's based on Reinforcement Learning (RL) or classic prediction models.

First, we'll merge the datasets on common columns (Date and Stock). Since Table1 and Table3 have a one-to-one relationship on the date of earnings release, we can merge them directly. The Table2 will be joined based on the nearest date since earnings data is quarterly and stock data is daily.

In [ ]:
table1_final = table1.copy()
table2_final = table2.copy()
table3_final = table3.copy()
In [ ]:
# Reset index for table1_final
table1_final.reset_index(inplace=True)

table1_final['Date'] = pd.to_datetime(table1_final['Date']).dt.date


# Convert 'Earnings Date' to datetime format
table3_final['Earnings Date'] = pd.to_datetime(table3_final['Earnings Date'])

# Extract just the date part from 'Earnings Date'
table3_final['Earnings Date'] = table3_final['Earnings Date'].dt.date

table3_final.rename(columns={'Earnings Date': 'Date'}, inplace=True)

Final Dataset¶

Here, after merging, the columns of 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue' will have Nan value, since EPS is reported quarterly but stock data is daily. We choose to drop Nan values because firstly we don't know the true variant EPS between different quarters and after droping Nan values we still have enough data.

In [ ]:
# Merge table1_final with table3_final on the Date
table1_final['Date'] = pd.to_datetime(table1_final['Date'])
table3_final['Date'] = pd.to_datetime(table3_final['Date'])

merged_data = pd.merge(table1_final, table3_final, how='left', left_on=['Date', 'Stock'], right_on=['Date', 'Stock'])


# Merge merged_data with table2_final on the nearest Date
table2_final['Date'] = pd.to_datetime(table2_final['Date'])
merged_data['Date'] = pd.to_datetime(merged_data['Date'])
merged_data = pd.merge_asof(merged_data.sort_values('Date'), table2_final.sort_values('Date'), by='Stock', on='Date', direction='nearest')


# Move 'Stock' column to the first position
cols = ['Stock'] + [col for col in merged_data.columns if col != 'Stock']
merged_data = merged_data[cols]

# Drop Nan
merged_data.dropna(subset=['EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue'], inplace=True)

merged_data.reset_index(drop=True, inplace=True)

# Display merged dataset
display(merged_data.head())

print(merged_data.shape)
display(merged_data.dtypes)
Stock Date Open High Low Close Adj Close Volume MACD MACD Signal MACD Histogram RSI Bollinger Lower Bollinger Middle Bollinger Upper EPS Estimate Reported EPS Surprise(%) Revenue
0 CRM 2010-02-24 17.080000 17.504999 17.022499 17.360001 17.360001 10460400 -0.075746 0.271555 -0.347302 59.118764 16.990480 17.204500 17.418519 0.04 0.04 6.89 354.0
1 AVGO 2010-02-24 17.500000 17.870001 17.410000 17.740000 12.838194 852300 -0.254461 0.146585 -0.401046 52.805132 17.210859 17.490000 17.769142 0.30 0.34 14.09 456.0
2 ANSS 2010-02-25 43.099998 43.639999 42.029999 43.619999 43.619999 462800 0.160317 0.380058 -0.219741 58.903683 42.637163 43.400000 44.162837 0.49 0.53 8.38 136.0
3 ADBE 2010-03-23 35.259998 35.419998 34.689999 35.220001 35.220001 8504500 0.240266 0.015448 0.224818 55.163065 34.452261 35.216000 35.979740 0.37 0.40 7.34 859.0
4 ACN 2010-03-25 42.099998 42.480000 41.500000 41.520000 32.030975 4643600 0.221055 0.011567 0.209488 48.319988 41.338574 42.040001 42.741428 0.61 0.60 -2.17 5538.0
(1918, 19)
Stock                       object
Date                datetime64[ns]
Open                       float64
High                       float64
Low                        float64
Close                      float64
Adj Close                  float64
Volume                       int64
MACD                       float64
MACD Signal                float64
MACD Histogram             float64
RSI                        float64
Bollinger Lower            float64
Bollinger Middle           float64
Bollinger Upper            float64
EPS Estimate               float64
Reported EPS               float64
Surprise(%)                float64
Revenue                    float64
dtype: object

3. Exploratory Data Analysis (EDA)¶

Summary Statistics:¶

Descriptive statistics for features.

In [ ]:
# Summary Statistics
print(merged_data.describe())
print(merged_data['Stock'].value_counts())
              Open         High          Low        Close    Adj Close  \
count  1918.000000  1918.000000  1918.000000  1918.000000  1918.000000   
mean     86.144694    87.407079    84.804475    86.164718    80.923545   
std      95.245227    96.730245    93.650931    95.220180    93.497167   
min       1.534000     1.544667     1.504000     1.522667     1.522667   
25%      26.885000    27.342500    26.400000    26.857500    23.057625   
50%      53.000000    53.720001    52.230333    53.049999    48.358913   
75%     108.297503   109.667500   106.392498   108.112497   101.733431   
max     675.700012   684.830017   663.609985   664.760010   664.760010   

             Volume         MACD  MACD Signal  MACD Histogram          RSI  \
count  1.918000e+03  1918.000000  1918.000000     1918.000000  1918.000000   
mean   3.714428e+07     0.441043     0.026970        0.414074    54.443669   
std    9.354634e+07     3.122555     0.918362        2.937325    11.591189   
min    8.830000e+04   -35.798676   -12.289617      -30.238577    12.441018   
25%    3.341182e+06    -0.177114    -0.099613       -0.188826    46.566632   
50%    1.001406e+07     0.253658     0.025031        0.250846    55.006830   
75%    3.702968e+07     0.988734     0.204714        0.896936    62.901392   
max    1.880998e+09    20.520009     6.946169       22.839408    83.747668   

       Bollinger Lower  Bollinger Middle  Bollinger Upper  EPS Estimate  \
count      1918.000000       1918.000000      1918.000000   1918.000000   
mean         83.287369         86.054272        88.821174      0.897482   
std          92.146079         95.436639        98.800248      1.132863   
min           1.449265          1.541333         1.561554     -0.500000   
25%          26.056246         27.019000        27.690315      0.300000   
50%          51.333744         52.679000        54.197460      0.595000   
75%         105.496420        108.981002       113.085280      1.050000   
max         662.692464        680.662012       698.631559     10.280000   

       Reported EPS  Surprise(%)        Revenue  
count   1918.000000  1918.000000    1918.000000  
mean       0.959614    13.721658    8114.944734  
std        1.179345    73.840355   15506.738689  
min       -0.870000 -1616.790000  -26397.000000  
25%        0.320000     2.000000     720.250000  
50%        0.630000     6.075000    2145.500000  
75%        1.157500    13.647500    8440.000000  
max       10.450000  1283.470000  137412.000000  
ORCL     52
MU       52
CRM      51
MSFT     51
CSCO     51
AMAT     51
INTC     51
AKAM     51
GOOGL    51
NVDA     51
AAPL     51
TXN      51
LRCX     51
QCOM     51
AVGO     50
KLAC     50
HPQ      50
IBM      50
FTNT     49
ADBE     49
CDNS     49
INTU     49
VMW      49
SWKS     48
SNPS     48
VRSN     47
AMD      46
AMZN     44
MCHP     44
TSLA     42
META     42
ANSS     40
WDAY     39
NOW      39
CTSH     33
KEYS     32
PYPL     28
HPE      28
SQ       26
ADI      22
ACN      18
FIS      18
GLW      18
TTD      18
DOCU     17
EPAM     12
NET       8
Name: Stock, dtype: int64

Summary Statistics Analysis:¶

Stock Price (Open, High, Low, Close, Adj Close): The count is consistent across these variables, indicating no missing values. The mean, median (50%), and standard deviation (std) provide a sense of the central tendency and spread of the data. We can find the average closing price (mean) is around 86, but std is around 95, suggesting that stock prices across different stocks vary widely.

Volume: The trading volume also varies greatly (std is much higher than the mean), and there's a huge range (min to max), indicating that some stocks are traded much more frequently than others.

Technical Indicators (MACD, RSI, Bollinger Bands): These are typical indicators used in stock price analysis. The RSI, for example, ranges from 12.44 to 83.74, with a mean of 54.46, suggesting that on average, stocks are neither overbought nor oversold.

EPS and Revenue: There's a lot of variability in EPS (Earnings Per Share) and revenue as well. The negative minimum values for Reported EPS and Revenue indicate losses for some quarters.

Surprise(%): This shows how much the actual earnings differed from the estimates. The huge range (from -1616.79% to 1283.47%) indicates some significant surprises in earnings reports.

For counting number, some stocks like 'NET' and 'EPAM' have fewer data points, which could be due to them being listed on the stock exchange more recently compared to others.

So base on the above analysis, the next steps in the EDA process will focus on visualizing and understanding the relationships and patterns in the data.

Top 10 Stocks by Variance:¶

From above statistics analysis, we know variance is really large. The figure below shows the top 10 stocks by variance.

In [ ]:
import numpy as np

# Calculate variance for each stock
stock_variance = merged_data.groupby('Stock')['Close'].var()

# Sort stocks by variance in descending order and pick top 10
top_10_stocks = stock_variance.sort_values(ascending=False).head(10).index

plt.figure(figsize=(15, 6))

# Plot only for top 10 stocks with highest variance
for stock in top_10_stocks:
    subset = merged_data[merged_data['Stock'] == stock]
    plt.plot(subset['Date'], subset['Close'], label=stock)

plt.xlabel('Date')
plt.ylabel('Close Price')
plt.title('Stock Prices Over Time for Top 10 Stocks by Variance')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

From above results, we can find even the variance of excellent companies such as TSLA and META is also very large. In addition, it is not difficult to see that after 2022, some stocks will have a big decline while others like AVGO will have a small decline, which reflects the importance of stock trading strategy and the necessity of stock recommendation.

Distribution of Stock Prices¶

Visualize the distribution of stock prices to understand their range and common values.

In [ ]:
plt.figure(figsize=(10, 6))
sns.histplot(data=merged_data, x='Close', kde=True)
plt.title('Distribution of Closing Prices')
plt.show()

From above results, we can know that the right-skewed distribution of stock prices in our dataset suggests that most stocks are clustered at lower prices, with a few outliers exhibiting significantly higher prices. The concentration of stocks at lower price points may represent more accessible investment opportunities for a broader range of investors, while the higher-priced outliers might be suitable for investors with a larger capital base. We will do the Volume Analysis to verify our findings.

Volume Traded vs. Stock Price¶

Explore the relationship between the volume traded and the stock price, as high trading volumes can often lead to significant price movements.

In [ ]:
plt.figure(figsize=(10, 6))

sns.regplot(x='Volume', y='Close', data=merged_data, scatter_kws={'alpha':0.3}, lowess=True)

# Setting the x-axis to a logarithmic scale for better visualization
plt.xscale('log')

plt.title('Volume Traded vs. Closing Price')
plt.xlabel('Volume Traded')
plt.ylabel('Closing Price')
plt.show()

The plot indicates a concentration of data points towards the lower end of the y-axis (Closing Price) and a widespread along the x-axis (Volume Traded). This pattern suggests that stocks with lower closing prices tend to have a broader range of trading volumes, while those with higher closing prices have relatively less variation in their trading volumes. From above analysis, we consider the volumes and history of stock price as the import features to do the future recommendation. The stock with lower price with high volumes tends to trade more frequently.

MACD vs. Stock Price and RSI vs. Stock Price¶

Explore the relationship between the MACD or RSI and the stock price, as they are import indicators can often indicate significant price movements.

In [ ]:
# Plotting MACD vs. Stock Price
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.regplot(x='MACD', y='Close', data=merged_data, lowess=True, scatter_kws={'alpha':0.3})
plt.title('MACD vs. Closing Price')
plt.xlabel('MACD')
plt.ylabel('Closing Price')

# Plotting RSI vs. Stock Price
plt.subplot(1, 2, 2)
sns.regplot(x='RSI', y='Close', data=merged_data, lowess=True, scatter_kws={'alpha':0.3})
plt.title('RSI vs. Closing Price')
plt.xlabel('RSI')
plt.ylabel('Closing Price')

plt.tight_layout()
plt.show()

From above plots, the MACD (Moving Average Convergence Divergence) plot shows a upside down pointy end pattern centered around the zero line. When MACD crosses above zero, it often signals an increasing positive momentum (bullish trend), and stocks may see rising prices. This information can be crucial for a stock recommendation system.

RSI Analysis: The RSI (Relative Strength Index) plot appears as a flat, positive line, suggesting that there isn't a strong, clear relationship between RSI values and stock prices. This might indicate that RSI alone may not be a sufficient predictor for stock price movement in this dataset. RSI may need to be combined with other indicators or used in specific contexts to provide meaningful insights for stock recommendations. In the feature engineering, I will use two methods to do future tasks, ignoring RSI and combining it with other features to see the difference.

Reported EPS vs. Stock Price, Surprise(%) vs. Stock Price, Revenue vs.Stock Price¶

Explore the relationship between the Reported EPS or Surprise(%) or Revenue and the stock price.

In [ ]:
# Plotting Reported EPS vs. Stock Price
plt.figure(figsize=(18, 6))
plt.subplot(1, 3, 1)
sns.scatterplot(x='Reported EPS', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Reported EPS', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Reported EPS vs. Closing Price')
plt.xlabel('Reported EPS')
plt.ylabel('Closing Price')

# Plotting Surprise(%) vs. Stock Price
plt.subplot(1, 3, 2)
sns.scatterplot(x='Surprise(%)', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Surprise(%)', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Surprise(%) vs. Closing Price')
plt.xlabel('Surprise(%)')
plt.ylabel('Closing Price')

# Plotting Revenue vs. Stock Price
plt.subplot(1, 3, 3)
sns.scatterplot(x='Revenue', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Revenue', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Revenue vs. Closing Price')
plt.xlabel('Revenue')
plt.ylabel('Closing Price')

plt.tight_layout()
plt.show()

From above plots we can know Reported EPS vs. Stock Price: The plot for Reported EPS (Earnings Per Share) versus Stock Price reveals a strong positive relationship. This indicates that stocks with higher earnings per share tend to have higher stock prices. This is consistent with fundamental analysis principles, as EPS is a direct measure of a company's profitability on a per-share basis. A higher EPS typically signals better financial health and thus a potentially more valuable investment.

Although this feature is the strongest correlation, we still see many points deviating from the correlation line, which proves that the stock price is not only affected by Reported EPS, but also necessary to analyze other features.

Surprise(%) vs. Stock Price: The relationship between Surprise(%) and Stock Price is gently positive. Surprise(%) measures the difference between actual earnings and analyst estimates. A positive surprise indicates that the company performed better than expected, which can lead to a positive stock price reaction. However, the gentle slope suggests that while earnings surprises can influence stock prices, other factors are also at play, and the impact may not be as pronounced or consistent.

Revenue vs. Stock Price: The plot for Revenue versus Stock Price also shows a gently positive trend. This indicates that companies with higher revenues tend to have higher stock prices, but the relationship is not as strong as with EPS. This makes sense because while revenue is an essential indicator of company size and market presence, it does not account for costs and expenses. Therefore, it's not as strong a profitability indicator as EPS.

Scatter Matrix and Correlation Matrix¶

Explore the relationship between different features and conclude the final feature table.

In [ ]:
features_subset = merged_data[['Open', 'High', 'Low', 'Close', 'Volume', 'MACD', 'RSI', 'Bollinger Upper', 'Bollinger Lower', 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue']
]

# Pair plot
sns.pairplot(features_subset)
plt.show()
In [ ]:
# Selecting relevant financial indicators
financial_indicators = ['Open', 'High', 'Low', 'Close', 'Volume', 'MACD', 'RSI', 'Bollinger Upper', 'Bollinger Lower', 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue']

# Creating the correlation matrix
corr_matrix = merged_data[financial_indicators].corr()

# Plotting the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Financial Indicators')
plt.show()

From above results(Scatter Matrix and Correlation Matrix), multiple features (such as Open, High, Low, Close, Bollinger Upper, and Bollinger Lower) show a perfectly linear relationship with a slope of 1, it implies that they are highly correlated, and changes in one feature are mirrored exactly by the others.

This means that these features carry very similar information for predictive modeling purposes. If they are perfectly or almost perfectly correlated, I will consider using just one of these features for future predictions to avoid redundancy.

I choose to only include the 'Close' price as prediction in my model in the feature(Open, High, Low, Close, Bollinger Upper, and Bollinger Lower), since it encapsulates the information provided by the other variables.

In addition, from previous analysis and scatter matrix, I select Volume, MACD, Reported EPS, Surprise(%), Revenue and stock history price as feature table used for future prediction. Because they are all related to stock close price especially Reported EPS with strong relationship to stock close price. RSI is the feature that we are not sure, so I decide to compare results with feature table including RSI and without RSI. From Correlation Matrix, I also find that Surprise(%), Revenue are weakly related to stock close price. So in the feature engineering, I will also compare the results with them or without them.

4. Model Development and Evaluation¶

Based on my previous analyses and feature selection, I can frame several distinct model questions that leverage different modeling approaches:

Model 1: Predicting Stock Price Movements with Supervised Learning¶

I utilize historical stock data and financial indicators to predict future closing prices of stocks. Our independent variables will include 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue', and historical stock prices. I will experiment with including and excluding 'RSI' to assess its impact on predictive performance. Our dependent variable will be the 'Close' price of the stock on trading for the next quarter. I employ a regression-based supervised learning model, a Random Forest Regressor, to capture the non-linear relationships observed in our exploratory data analysis. Our EDA supports this approach, as it revealed strong correlations between these features and the closing price. I split our data into training and testing sets by 5 fold cross validation to evaluate the model's performance and ensure it generalizes well to unseen data.

Model 2: Optimizing Stock Trading Decisions with Contextual Bandits¶

I deploy a Contextual Bandits model to optimize trading decisions in the stock market. State, Actions and Reward is defined as following: State (Context): The state consists of stock features of 'Close' price, 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', and 'Revenue'.

Actions: The actions is "buy", "sell", or "hold".

Reward: The reward function is the immediate profit or loss from an action. For example, if the agent decides to buy, and the stock price increases in the next time step, the reward is positive. Conversely, if the price decreases, the reward is negative.

Our EDA supports this approach, as it revealed strong correlations between these features and the closing price. I split our data into training and testing sets. I train the model using training data and evaluate its performance using a testing data.

These models will enable us to address different aspects of stock market prediction and trading optimization. The supervised learning model will focus on predicting future prices, while the RL model will aim at making profitable trading decisions.

Model1¶

Random Forest Regressor¶

The Random Forest Regressor is a powerful ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the mean prediction of the individual trees. It's particularly known for its high accuracy, ability to handle large datasets with higher dimensionality, and its feature importance capabilities which provide insight into the predictive power of each feature. Here I use this model because it can capture non-linear property of the features. I choose the number of decision trees is 100. I use 5 fold cross validation to evaluate the model's performance. Metrics are Average RMSE, Average MAE, Average R-squared.

In [ ]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np


features = ['Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue']
target = 'Close'
# Initialize the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

kf = KFold(n_splits=5, random_state=42, shuffle=True)
rmse_scores = []
mae_scores = []
r2_scores = []

all_y_tests = []
all_predictions = []



for train_index, test_index in kf.split(merged_data):
    X_train, X_test = merged_data[features].iloc[train_index], merged_data[features].iloc[test_index]
    y_train, y_test = merged_data[target].iloc[train_index], merged_data[target].iloc[test_index]

    rf_regressor.fit(X_train, y_train)
    predictions = rf_regressor.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    rmse_scores.append(rmse)
    mae_scores.append(mae)
    r2_scores.append(r2)

    # Store actual and predicted values
    all_y_tests.extend(y_test)
    all_predictions.extend(predictions)

# Calculate average metrics over all windows
average_rmse = np.mean(rmse_scores)
average_mae = np.mean(mae_scores)
average_r2 = np.mean(r2_scores)

print(f"Average RMSE: {average_rmse}")
print(f"Average MAE: {average_mae}")
print(f"Average R-squared: {average_r2}")


# Plot Actual vs. Predicted for aggregated data
plt.figure(figsize=(12, 6))
plt.scatter(all_y_tests, all_predictions, alpha=0.5)
plt.plot([min(all_y_tests), max(all_y_tests)], [min(all_y_tests), max(all_y_tests)], 'k--', lw=2)
plt.xlabel('Actual Close Prices')
plt.ylabel('Predicted Close Prices')
plt.title('Actual vs. Predicted Close Prices - Aggregated')
plt.show()

# Calculate and plot residuals for aggregated data
residuals = np.array(all_y_tests) - np.array(all_predictions)
plt.figure(figsize=(12, 6))
plt.scatter(all_predictions, residuals, alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Close Prices')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted - Aggregated')
plt.show()

# Feature importance
importances = rf_regressor.feature_importances_
indices = np.argsort(importances)[::-1]

# Plot the feature importances
plt.figure(figsize=(12,6))
plt.title('Feature Importances')
plt.bar(range(X_train.shape[1]), importances[indices], color="r", align="center")
plt.xticks(range(X_train.shape[1]), [features[i] for i in indices])
plt.xlim([-1, X_train.shape[1]])
plt.show()
Average RMSE: 44.79412216827933
Average MAE: 25.027611104795646
Average R-squared: 0.7767136551479087

The above is the results of Random Forest Regressor. As we see, in the Figure Actual vs. Predicted Close Prices, most of points are around the black line. In the residuals, most of points are around 0. Those results can show the effectivity of the model. For metrics:

  • Average RMSE (Root Mean Square Error): 0.4711
    The RMSE measures the average magnitude of the errors between predicted and actual values. A lower RMSE value is desirable, and in our case, an RMSE of 0.4711 indicates that the model has a relatively low error rate in its predictions.

  • Average MAE (Mean Absolute Error): 0.2627
    MAE provides a linear score of the errors. An MAE of 0.2627 suggests that the model's predictions are, on average, within 0.2627 units of the actual closing price values.

  • Average R-squared: 0.7764
    The R-squared value indicates the proportion of variance for the dependent variable that's explained by the independent variables in the model. An R-squared of 0.7764 demonstrates that approximately 77.64% of the variability in our stock price predictions can be explained by the model, which is considered to be a strong fit.

    More importantly, we output feature importance. It can be found that Reported EPS and MACD are important features in this model. This is consistent with our observations in EDA.

Model2¶

Contextual Bandits: LinUCB¶

Contextual bandit is a Reinforcement Learning approach. We build a context bandit model to optimize trading decisions in the stock market. State, Actions and Reward is defined as following: State (Context): The state consists of stock features of 'Close' price, 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', and 'Revenue'.

Actions(Arms): The actions is "buy", "sell", or "hold".

Reward: This is our model's core component. I use a get_reward function dictates the immediate reward or penalty received after taking an action (buy, sell, or hold) on a stock. The function carefully assesses the state of the portfolio to make decisions that align with realistic trading scenarios.

Reward of Buy Action: A "buy" action and the stock is not already held in the portfolio, the model simulates a future sale of the stock. I assume a sale after a specified delay, projected by the delay parameter. The reward is the profit from buying at the current price and selling at this future price.

Reward of Sell Action: Conversely, if the "sell" action is chosen and the stock is present in the portfolio, the function calculates the profit from selling the stock based on its buying price. It then updates the portfolio to reflect the sale by resetting the stock's position and buying price.

Reward of Hold Action: If the action is to "hold," no immediate financial change occurs, thus no profit or loss is recorded. This action reflects a strategy of patience, waiting for a more opportune moment to act.

The following is the code for Contextual Bandits. I choose LinUCB as the bandit model. LinUCB stands for Linear Upper Confidence Bound. It's a strategy for making decisions when you're not quite sure what the outcome will be.

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Normalize the features
scaler = StandardScaler()
feature_columns = ['Close', 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue']
merged_data[feature_columns] = scaler.fit_transform(merged_data[feature_columns])

# Split the dataset into train and test sets
train_data, test_data = train_test_split(merged_data, test_size=0.4, random_state=42)
print(f"Training Data Shape: {train_data.shape}")
print(f"Testing Data Shape: {test_data.shape}")
Training Data Shape: (1150, 19)
Testing Data Shape: (768, 19)

The following is the LinUCB and reward code. We assume for each stock, we have a position to control our portfolio. If position >0, we can sell or hold. If position = 0, we can buy or hold.

In [ ]:
class LinUCB:
    def __init__(self, alpha, n_arms, n_features):
        self.alpha = alpha
        self.n_arms = n_arms
        self.n_features = n_features
        self.A = [np.identity(n_features) for _ in range(n_arms)]
        self.b = [np.zeros(n_features) for _ in range(n_arms)]


    def recommend_arm(self, x, stock, portfolio):
        p = np.zeros(self.n_arms)
        for arm in range(self.n_arms):
            A_inv = np.linalg.inv(self.A[arm])
            theta = A_inv @ self.b[arm]
            p[arm] = theta.T @ x + self.alpha * np.sqrt(x.T @ A_inv @ x)

        # Adjust probabilities based on portfolio constraints
        has_stock = stock in portfolio and portfolio[stock]['positions'] > 0
        if has_stock:
            p[1] = 0  # Set probability of buying to zero if stock already owned
        else:
            p[2] = 0  # Set probability of selling to zero if stock not owned

        return np.argmax(p)

    def update(self, chosen_arm, x, reward):
        self.A[chosen_arm] += x @ x.T
        self.b[chosen_arm] += reward * x



def get_reward(chosen_arm, stock, current_price, next_price, portfolio, trade_size=1, delay=5):
    profit_loss = 0
    global train_data, index

    # Check if the stock is currently held in the portfolio
    has_stock = stock in portfolio and portfolio[stock]['positions'] > 0

    # Buy logic
    if chosen_arm == 1 and not has_stock:  # buy only if not already holding the stock
        future_index = index + delay if index + delay < len(train_data) else -1
        future_sell_price = train_data.iloc[future_index]['Close'] if future_index != -1 else next_price
        profit_loss = (future_sell_price - current_price) * trade_size
        portfolio[stock] = {'positions': trade_size, 'buying_price': current_price}

    # Sell logic
    elif chosen_arm == 2 and has_stock:  # sell only if the stock is in the portfolio
        profit_loss = (current_price - portfolio[stock]['buying_price']) * portfolio[stock]['positions']
        portfolio[stock] = {'positions': 0, 'buying_price': None}

    # Hold logic
    elif chosen_arm == 0:
        profit_loss = 0  # No profit or loss if holding

    return profit_loss, portfolio
In [ ]:
train_data.reset_index(inplace=True)

# Sort the DataFrame by 'Date' to ensure chronological order
train_data.sort_values(by='Date', inplace=True)

# Reset the index again to integer-based for iteration purposes
train_data.reset_index(drop=True, inplace=True)

# Initialize the bandit algorithm
alpha = 1.5  # Controls exploration-exploitation trade-off
n_arms = 3   # Number of arms: buy, sell, hold
n_features = len(feature_columns)  # Number of features
bandit = LinUCB(alpha, n_arms, n_features)

# Initialize variables to count arm selections and to track portfolio
arm_counts = {'buy': 0, 'sell': 0, 'hold': 0}
portfolio = {}

# Training loop for the bandit on the training set
train_rewards = []

for index, row in train_data.iterrows():
    # Extract the current and next price along with other features
    current_features = np.array(row[feature_columns].values, dtype=np.float64)
    current_price = row['Close']
    next_row = train_data.iloc[index + 1] if index + 1 < len(train_data) else row
    next_price = next_row['Close']
    stock = row['Stock']

    chosen_arm = bandit.recommend_arm(current_features, stock, portfolio)
    arm_counts['buy' if chosen_arm == 0 else 'sell' if chosen_arm == 1 else 'hold'] += 1  # Update arm count

    # Receive reward and update portfolio
    reward, updated_portfolio = get_reward(chosen_arm, stock, current_price, next_price, portfolio)
    portfolio = updated_portfolio

    # Update the bandit
    bandit.update(chosen_arm, current_features, reward)

    # Keep track of the rewards
    train_rewards.append(reward)

print("Arm selection counts:", arm_counts)

# Evaluating on the test set
test_rewards = []
portfolio = {}  # Reset portfolio for testing

for index, row in test_data.iterrows():
    current_features = np.array(row[feature_columns].values, dtype=np.float64)
    current_price = row['Close']
    next_row = test_data.iloc[index + 1] if index + 1 < len(test_data) else row
    next_price = next_row['Close']
    stock = row['Stock']

    chosen_arm = bandit.recommend_arm(current_features, stock, portfolio)

    # Receive reward and update portfolio
    reward, updated_portfolio = get_reward(chosen_arm, stock, current_price, next_price, portfolio)
    portfolio = updated_portfolio

    test_rewards.append(reward)

print(len(test_rewards))

# Compute cumulative rewards for plotting
cumulative_train_rewards = np.cumsum(train_rewards)
cumulative_test_rewards = np.cumsum(test_rewards)
Arm selection counts: {'buy': 536, 'sell': 310, 'hold': 304}
768

We print the arm selection counts, and can find that fits logic since buy actions larger than sell actions. Because we assume when we sell stock, which should be bought before.

In [ ]:
# Plot the cumulative rewards for the training set
plt.figure(figsize=(14, 7))
plt.plot(cumulative_train_rewards, label='Train Cumulative Reward')
plt.xlabel('Time step')
plt.ylabel('Cumulative Reward')
plt.title('Cumulative Reward Over Training')
plt.legend()
plt.show()

# Plot the cumulative rewards for the test set
plt.figure(figsize=(14, 7))
plt.plot(cumulative_test_rewards, label='Test Cumulative Reward', color='orange')
plt.xlabel('Time step')
plt.ylabel('Cumulative Reward')
plt.title('Cumulative Reward Over Testing')
plt.legend()
plt.show()

As we see, Cumulative Reward Over Training and Cumulative Reward Over Testing gradually increase, which proves the model works.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt

# Extract feature importance for each arm
feature_importance = np.zeros((bandit.n_arms, bandit.n_features))
for arm in range(bandit.n_arms):
    A_inv = np.linalg.inv(bandit.A[arm])
    theta = A_inv @ bandit.b[arm]
    feature_importance[arm] = np.abs(theta)  # Using absolute value for importance

# Print feature importance for each arm
for i, arm_name in enumerate(['Hold', 'Buy', 'Sell']):
    print(f"Feature importance for {arm_name} Arm:", feature_importance[i])

# Plot feature importance
fig, axs = plt.subplots(nrows=1, ncols=bandit.n_arms, figsize=(15, 5))
arm_names = ['Hold', 'Buy', 'Sell']
for i, ax in enumerate(axs):
    ax.barh(range(bandit.n_features), feature_importance[i])
    ax.set_yticks(range(bandit.n_features))
    ax.set_yticklabels(feature_columns)
    ax.set_title(f'Feature Importance for {arm_names[i]} Arm')

plt.tight_layout()
plt.show()
Feature importance for Hold Arm: [0. 0. 0. 0. 0. 0.]
Feature importance for Buy Arm: [47.32381935 20.31967445  7.59087696 20.56367162 28.52347963 11.44132221]
Feature importance for Sell Arm: [ 1.37747311  4.47579515 16.49414544 11.36315491  3.23866077  1.21059019]

We can know Feature Importance from above results:

Holding Action When it comes to holding onto stocks, the model indicates a neutral stance across all features, as reflected by an array of zeros. This suggests that none of the features had a significant impact on the decision to hold. This related to our reward design since for hold action, and we design reward is 0 for hold.

Buying Action On the other hand, the decision to buy is influenced heavily by certain features. The model places substantial importance on the 'Close' price, with a value of 47.32, indicating that recent closing prices greatly influence the decision to purchase a stock. The 'Reported EPS' and 'Surprise(%)' also play significant roles, with values of 20.56 and 28.52, respectively, suggesting that earnings information is a strong driver in the purchase of stocks.

Selling Action Conversely, when it comes to selling, the 'MACD' indicator stands out with a feature importance value of 16.49, implying that trends and momentum garnered from MACD significantly inform sell decisions. Interestingly, 'Volume' and 'Reported EPS' also contribute to the decision, albeit to a lesser extent, with values of 4.47 and 11.36, respectively.

5. Conclusion¶

What are the most crucial features that influence stock prices?

Key Features for Model 1: 'Reported EPS' and 'MACD' are most crucial features for stock price prediction, which fits our observation in EDA where Reported EPS has strong relationship with Close price.

Key Features Influencing Stock Prices 'Buy' Action Our result has revealed that certain features stand as pivotal influencers of stock prices 'Buy' Action. Notably, the 'Close' price is a predominant factor, with earnings information such as 'Reported EPS' and 'Surprise(%)' following closely. These elements suggest that past performance and unexpected earnings outcomes are significant to market participants.

Key Features Influencing Stock Prices 'Sell' Action 'MACD' and 'Reported EPS' are most crucial features for Sell Action. That indicates the trend factor influences most and and unexpected earnings outcomes are significant to market participants.

How do these features interact with each other, and what are their relationships?

Price Features (Open, High, Low, Close, Bollinger Upper, Bollinger Lower): These features exhibit perfect positive correlations with each other. This indicates redundancy among these indicators, as they move in lockstep. This is expected as they are all direct measures of stock price movement within a trading session.

Volume and MACD: Volume shows a mild negative correlation with the price features and a slightly stronger negative correlation with MACD. This suggests that higher trade volumes may coincide with downward pressure on the MACD indicator, potentially signaling a reversal or decrease in momentum.

Earnings Per Share (EPS Estimate, Reported EPS): Both EPS Estimate and Reported EPS show strong positive correlations with each other and moderate positive correlations with the price features. This indicates that better-than-expected earnings are typically associated with higher stock prices, reflecting the market's valuation adjustments based on company performance.

Surprise(%): This feature shows little to no correlation with most other features, implying that the surprise element of earnings reports does not consistently move with other market indicators. This could be due to the market's varied reactions to earnings surprises, depending on context and expectations.

Revenue: Like Surprise(%), Revenue shows very weak correlations with other features, suggesting that it has a unique influence on stock prices that is not mirrored by the other indicators in the matrix.

Can we predict future stock prices or do the recommendation based on the crucial features?

Yes, we can. With the Random Forest Regressor and Contextual Bandits model, we demonstrated the potential of using historical data and crucial features to predict stock prices and recommend trading actions. While predictions were not without error, the significant R-squared value obtained from the Random Forest Regressor indicates a strong model fit to the data, suggesting that, to an extent, future stock prices can be anticipated.

The Contextual Bandits model, on the other hand, offered a strategic dimension, recommending actions based on real-time feature analysis. It provided a nuanced approach to stock trading, where not just the prediction of prices, but the timing of actions—buying, selling, or holding—was optimized for potential gains.

In [ ]:
!pip install --upgrade nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (7.12.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.11.2)
Requirement already satisfied: bleach!=5.0.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1)
Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.2)
Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.5.0)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.3.0)
Requirement already satisfied: markupsafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.3)
Requirement already satisfied: mistune<4,>=2.0.3 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.0.2)
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.9.0)
Requirement already satisfied: nbformat>=5.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.9.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (23.2)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.0)
Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.16.1)
Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.2.1)
Requirement already satisfied: traitlets>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach!=5.0.0->nbconvert) (1.16.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach!=5.0.0->nbconvert) (0.5.1)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (4.0.0)
Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12)
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7->nbconvert) (2.19.0)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7->nbconvert) (4.19.2)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.5)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (2023.11.2)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.31.1)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.13.2)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (23.2.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2)
Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.2)
In [ ]:
%%shell
jupyter nbconvert --to html /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.ipynb
[NbConvertApp] Converting notebook /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 4 image(s).
[NbConvertApp] Writing 787082 bytes to /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.html
Out[ ]: